======================================================================

by RUI ZHANG December 05,2017

Introduction

This is the data exploration on 2016 US presidential campaign finance contribution in New York state. The dataset is available at http://classic.fec.gov/disclosurep/pnational.do, which includes the financial information disclosed by presidential candidates for their campaign and all the individual donations that were over $100,000.

In 2016 president campaign, the result unexpectedly brought Trump into White House, and Hilary missed the chance to be the first female president in US. When people naturally assumed they would see a female president next morning, this reversal astonished them without clues. So, what if we have the financial campaign data? What can we get from the dataset? I will explore the nature of campaign contributions and see if there are any interesting relationships in the data, such as: - Which party and candidates got the most support? - What’s the difference between different party's supporters? - How’s the donations distributed spatially? - Is there any model that we could build and make some predictions?

Let’s we load the data and certain packages.

# Load all of the packages that you end up using in your analysis in this code
# chunk.

# Notice that the parameter "echo" was set to FALSE for this code chunk. This
# prevents the code from displaying in the knitted HTML output. You should set
# echo=FALSE for all code chunks in your file, unless it makes sense for your
# report to show the code that generated a particular plot.

# The other parameters for "message" and "warning" should also be set to FALSE
# for other code chunks once you have verified that each plot comes out as you
# want it to. This will clean up the flow of your report.

library(ggplot2)
library(dplyr)
library(psych)
library("gridExtra")
library(zipcode)
library(choroplethrZip)
library(choroplethr)
library(devtools)
library(choroplethrMaps)
library(gender)
library(lubridate)# for the year() function
library(cowplot)# Could define the size of the plot
library(polycor) # hector

getwd()
## [1] "/Users/apple/Desktop/project/Udacity/data_analyst/data_exploration_analysis/project/Financial-Data-Exploration-on-2016-President-Campaign-in-NY"
# Load the Data
#wine<-read.csv("data/wineQualityReds.csv")
#summary(wine)
#loan<-read.csv("data/prosperLoanData.csv")
#head(loan)
campaign<-read.csv("P00000001-NY.csv",row.names = NULL)
names(campaign)
# subeset the dataset - CLEARLY STATE THE VARIABLES I used
#tmpdata <- subset(campaign, select = -c(varname1, varname2))
#tmpdata <- campaign[ , !names(campaign) %in% c("varname1", "varname2")]
dim(campaign)
str(campaign)
head(campaign)
#View(campaign)

Description of Data Set: In this dataset, there are 649460 records with 18 variables. Most varaibles are catergorical.The dataset include the info about the candidate they support, the contributor name, location, career and the money they contributed in New York.

Univariate Plots Section

Hint: preliminary exploration of the dataset. summaries/univariate plots to understand the structure of the individual variables in your dataset.

Summary

str(campaign)
## 'data.frame':    649460 obs. of  18 variables:
##  $ cmte_id          : Factor w/ 25 levels "C00458844","C00500587",..: 6 6 7 7 6 7 7 6 6 15 ...
##  $ cand_id          : Factor w/ 25 levels "P00003392","P20002671",..: 1 1 12 12 1 12 12 1 1 23 ...
##  $ cand_nm          : Factor w/ 25 levels "Bush, Jeb","Carson, Benjamin S.",..: 4 4 20 20 4 20 20 4 4 23 ...
##  $ contbr_nm        : Factor w/ 119407 levels " BLACKMORE, ANDI POTAMKIN",..: 52356 19422 54646 61896 8862 63027 63080 54730 55289 91553 ...
##  $ contbr_city      : Factor w/ 2327 levels ""," BROOKLYN",..: 1424 287 1424 269 1620 1406 1644 1424 595 1424 ...
##  $ contbr_st        : Factor w/ 1 level "NY": 1 1 1 1 1 1 1 1 1 1 ...
##  $ contbr_zip       : Factor w/ 69028 levels "","`1136","0",..: 7135 64432 5890 40174 58413 30757 49101 5353 62817 69028 ...
##  $ contbr_employer  : Factor w/ 39302 levels ""," ENGENDERHEALTH",..: 23091 30023 24570 24031 16445 24882 33393 30988 23091 16445 ...
##  $ contbr_occupation: Factor w/ 17219 levels ""," ADMINISTRATIVE ASSISTANT",..: 13288 1249 10332 16184 7896 15445 12064 16984 13288 7896 ...
##  $ contb_receipt_amt: num  100 67 50 15 100 ...
##  $ contb_receipt_dt : Factor w/ 697 levels "1-Apr-15","1-Apr-16",..: 137 367 622 577 71 599 599 229 656 22 ...
##  $ receipt_desc     : Factor w/ 27 levels ""," SEE REATTRIBUTION",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ memo_cd          : Factor w/ 2 levels "","X": 2 2 1 1 2 1 1 2 2 2 ...
##  $ memo_text        : Factor w/ 262 levels ""," SEE REATTRIBUTION",..: 33 33 5 5 33 5 5 33 33 1 ...
##  $ form_tp          : Factor w/ 3 levels "SA17A","SA18",..: 2 2 1 1 2 1 1 2 2 2 ...
##  $ file_num         : int  1091718 1091718 1077404 1077404 1091718 1077404 1077404 1091718 1091718 1146165 ...
##  $ tran_id          : Factor w/ 648415 levels "A00136D649BB94F13985",..: 275346 275731 533152 531094 275039 532242 531783 275424 274830 450800 ...
##  $ election_tp      : Factor w/ 6 levels "","G2016","O2016",..: 5 5 5 5 5 5 5 5 5 2 ...
summary(campaign)
##       cmte_id            cand_id                            cand_nm      
##  C00575795:399522   P00003392:399522   Clinton, Hillary Rodham  :399522  
##  C00577130:174564   P60007168:174564   Sanders, Bernard         :174564  
##  C00580100: 36931   P80001571: 36931   Trump, Donald J.         : 36931  
##  C00574624: 16785   P60006111: 16785   Cruz, Rafael Edward 'Ted': 16785  
##  C00573519:  6638   P60005915:  6638   Carson, Benjamin S.      :  6638  
##  C00458844:  4813   P60006723:  4813   Rubio, Marco             :  4813  
##  (Other)  : 10207   (Other)  : 10207   (Other)                  : 10207  
##              contbr_nm             contbr_city     contbr_st  
##  BODNICK, KATIE   :  1326   NEW YORK     :206993   NY:649460  
##  BRUN, GINA       :   413   BROOKLYN     : 86953              
##  BRONER, NAHAMA   :   318   BRONX        : 14102              
##  SCHWARTZ, HILARY :   311   ROCHESTER    :  9985              
##  KILLORIN, MICHAEL:   310   STATEN ISLAND:  7431              
##  GRODY, GORDON    :   307   BUFFALO      :  6196              
##  (Other)          :646475   (Other)      :317800              
##      contbr_zip          contbr_employer               contbr_occupation 
##  100015704:  1332   N/A          : 82294   RETIRED              : 98667  
##  10024    :  1197   SELF-EMPLOYED: 67025   NOT EMPLOYED         : 47994  
##  10022    :   871   RETIRED      : 42105   ATTORNEY             : 26486  
##  10023    :   864   NONE         : 32333   INFORMATION REQUESTED: 16956  
##  10025    :   829   NOT EMPLOYED : 21881   TEACHER              : 15080  
##  10128    :   745   (Other)      :403497   (Other)              :444199  
##  (Other)  :643622   NA's         :   325   NA's                 :    78  
##  contb_receipt_amt   contb_receipt_dt                      receipt_desc   
##  Min.   :  -10100   6-Nov-16 :  6125                             :641311  
##  1st Qu.:      15   31-Oct-16:  5924   Refund                    :  5994  
##  Median :      27   26-Sep-16:  5922   REDESIGNATION TO GENERAL  :   406  
##  Mean   :     264   2-Nov-16 :  5908   REDESIGNATION FROM PRIMARY:   404  
##  3rd Qu.:     100   4-Nov-16 :  5814   REATTRIBUTION FROM SPOUSE :   221  
##  Max.   :12777706   31-Mar-16:  5768   REATTRIBUTION TO SPOUSE   :   221  
##                     (Other)  :613999   (Other)                   :   903  
##  memo_cd                                  memo_text       form_tp      
##   :541113                                      :398114   SA17A:537819  
##  X:108347   * EARMARKED CONTRIBUTION: SEE BELOW:170844   SA18 :105647  
##             * HILLARY VICTORY FUND             : 75710   SB28A:  5994  
##             *BEST EFFORTS UPDATE               :   773                 
##             * HILLARY ACTION FUND              :   572                 
##             REDESIGNATION TO GENERAL           :   406                 
##             (Other)                            :  3041                 
##     file_num             tran_id       election_tp   
##  Min.   :1003942   SA17A.4846:     3        :   690  
##  1st Qu.:1079445   C10000499 :     2   G2016:271367  
##  Median :1104813   C10000663 :     2   O2016:   237  
##  Mean   :1105477   C10091902 :     2   P2015:     1  
##  3rd Qu.:1133832   C1013282  :     2   P2016:377162  
##  Max.   :1146285   C1014914  :     2   P2020:     3  
##                    (Other)   :649447

In order to fully understand the data of presidential campaign, I would love to add two more datasets - zipcode and demographic into our dataset.

# geographic data
#data(zip.map)
#str(zip.map)
data(zipcode)
str(zipcode)
## 'data.frame':    44336 obs. of  5 variables:
##  $ zip      : chr  "00210" "00211" "00212" "00213" ...
##  $ city     : chr  "Portsmouth" "Portsmouth" "Portsmouth" "Portsmouth" ...
##  $ state    : chr  "NH" "NH" "NH" "NH" ...
##  $ latitude : num  43 43 43 43 43 ...
##  $ longitude: num  -71 -71 -71 -71 -71 ...
# demographic data
data(df_zip_demographics)
str(df_zip_demographics)
## 'data.frame':    33120 obs. of  9 variables:
##  $ region           : chr  "00601" "00602" "00603" "00606" ...
##  $ total_population : num  18450 41302 53683 6591 28963 ...
##  $ percent_white    : num  1 4 2 0 1 0 0 1 2 0 ...
##  $ percent_black    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ percent_asian    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ percent_hispanic : num  99 94 96 100 99 100 100 99 98 100 ...
##  $ per_capita_income: num  7380 8463 9176 6383 7892 ...
##  $ median_rent      : num  285 319 252 230 334 315 285 338 400 319 ...
##  $ median_age       : num  36.6 38.6 38.9 37.3 39.2 38.5 40.9 36.2 42 39.7 ...

Data Manipulation

Before doing anything related to preprocessing, I would love to first select the features that we are interested in or might be helpful for the processing or analysis.

feature selection

# Clearly state the names of column
#names(campaign)
campaign <- campaign[ , !names(campaign) %in% c("cmte_id", "vcand_id","receipt_desc", "memo_cd", "memo_text", "form_tp", "file_num", "tran_id", "election_tp")]
#campaign<-campaign[,1:11]
hist(log(campaign$contb_receipt_amt),main="Histgram of the 2017 Presidential Contribution")

Dataset Issue: This dataset includes 649460 obs. of 18 variables, and there are 15 categorical variables. Some missing values are in the contbr_zip, contbr_employer, contbr_occupation, and election_tp. Also, There are some inconsistency in the contbr_employer and contributor street name that we need to preprocessed. By the way, as there is no parties info for each candidate, I will add a new variable for that.In addition, contb_receipt_amt has some negative value which we could notify in the above plot, and I will remove those values.

So, in general, we will do

Now, I want to clean the zipcodes to 5 digits, and then we can relate them to zipcode and demographic datasets.

Finally, I will add gender info into the dataset based on the candidate first name. Not all the people could get their gender prediction as their first name might be abbreviated or non-traditional.

Summary:After processing the data and adding additional variables, the dataset has some addtional variables, and the detialed explaination is below:

  1. date_upto_elec: Num of dates before the presidential final campaign result.

  2. party: candidate’s political party affiliation.

  3. contbr_FirstName: Split the contributor name and get the first name info.

  4. contbr_LastName: Split the contributor name and get the last name info.

  5. cand_FirstName: Split the candidate name and get the first name info.

  6. cand_LastName:split the candidate name and get the last name info.

  7. longitude: The contributor’s geolocation longitude.

  8. latitude: The contributor’s geolocation latitude.

  9. total_population: the correspondent zipcode region population.

  10. percent_white: the correspondent zipcode region white people percentage.

  11. percent_black: the correspondent zipcode region black people percentage.

  12. percent_asian: the correspondent zipcode region asian people percentage.

  13. percent_hispanic: the correspondent zipcode region hispanic people percentage.

  14. per_capita_income: the correspondent zipcode region income per capita.

  15. median_rent: the correspondent zipcode region median rent price.

  16. median_age: the correspondent zipcode region median age.

  17. gender: the gender of contributor.

Univariate Plot Analysis

As we preprocess the dataset, now we could start the univariate plot analysis. From the general perspective, Hillary was the candidate that receive most contributions,and democrat got more contributions in New York. The city with more donations is New York, and retired people is the important donation group that had a big proportion in 2017. The median of contribution in 2017 presidential campaign is 274.

For the most important field contribution amount, we could see that most of contributions are below 1000 dollars and the distribution is right-screwed distribution.

Now, let’s check the date of those contribution.

For the date info related to those donations, we could notice that there were more donations in 2016, and seems there are more people to donate except the holiday season.

In 2017 presidential campaign, there are 5 candidates are democrats, 3 are third party and all the others are republican.

As we could see, most contributions are gotten by democrats as new york is huge state for democrat supporter.

Then, for the huge part in the dataset related to contributor info, I will go through the name, location info, gender info and occupation info to find out what’s characteristic behind each party’s supporters.

First, Name info. As the length is 119407 and the total is 649460, which means most of contributors only 1-6 several contributions. So it won’t be our interest in our analysis.

# First, Name info
length(table(campaign$contbr_nm))
mean(table(campaign$contbr_nm))

Then, location info. As there are a lot discrepency in this dataset, it need time to clean it up. From the perspective of preprocessing difficulties, I will choose zip code and keep 5 digits to represent the location info.

length(table(campaign$contbr_city))#2327
table(campaign$contbr_city)
length(table(campaign$contbr_st)) #1
length(table(campaign$contbr_zip))#69031
#library(ggthemes)
#scale_colour_tableau()

As we could see,most of donators were from the New York City.

For the demographics features behind the zipcode, they might closely related to the contributors in its region. We could see most donators are from regions with more white people, more aged from 25 to 50, and income per capita from 20000 to 40000.

For the Gender Info, we could see there are more females contributors than males in new york state, but the difference is not that large. What’s the reason behind that? Is that significant different?

For the occupation info, I will show the top 25 occupations with more contributions in the 2017 political campaign.From the plot, we could observe that here are several big categories are retired, not employed,self-employed, and employed. In the employed group, lawyer, professor, CEO, and docors were more actively involoved.

In general, we could see employed, retired, unemployed actively support the presidential campaign in sequence. And for the employed people, laywer, professor, CEO are more active, of which laywer and CEO donate the most.

For the employer info, we know the info doesn’t clean up and there are several names for only one employer. It will take some work if I go thorugh to get all that clean up, and i will only focus on the top 25 employer and reduce the duplicates.

The same as the occupation info, there are four big categories- unemployed, self-employed, retired and employed, and the employed part is composed of all the companies in our dataset.So in this section,I will plot the four big categories, and then draw the top companies in the employed category with most donations. As we could see, people in several university were proacively involoved in the contribution of presidential campaign, followed by private foundation, goverment agency and some big technology companies.

Question Answering

What is the structure of your dataset?

This dataset includes 649460 obs. of 18 variables, and most of them are categorical variables (15).

What is/are the main feature(s) of interest in your dataset?

From what I explored above, the main feature that i am interested are, - What’s the difference between those parties supporters?

  • Which party will the contributor donate to based on their background info?

  • What could impact the contribution amount?

  • What’s the best time to get the donations and where are those huge financial support coming from?

What other features in the dataset do you think will help support your analysis?
  • Candidiate’s name info: I could get their party info based on their name, which could help us understand a political party general support picture in New York.

  • Contributor’s career info: Although it is too varied, it still carried some important info like which industries they are in and how much they could earn, and all those info might be the reason that they put their donation to the certain party.

  • Contributor’s location info: Provide the spatial distribution of those contributor, which might have the difference between different parties.

  • Contributor’s name info: Although most people do not continuously donate during that campaign period, we could guess their gender info only based on the name, which might be useful to look into the gender distribution in the NY campaign contribution.

  • Zipcode demographics info: It will help us to understand our contributors as well as the donation they put.

Did you create any new variables from existing variables in the dataset?
  • Candidates’ parties info.
  • Contributors’ gender info.
  • Demographic info: including the total population, the income level, rent level, and the race percentage.
Of the features you investigated, were there any unusual distributions?
  • For the contb_receipt_amt distribution, there are some negative value, which are not usual as the donation should be positive.
Did you perform any operations on the data to tidy, adjust, or change the form  of the data? If so, why did you do this?

For this tidy dataset, it has some inconsistency in the contbr_employer, ontbr_occupation, and the contributors’ street name and might need to preprocess. As the location info is too messy, I droped that variables and used the zipcode library to clean up represent the location.For the occupation and employer info, as there are too various, I am deciding to visualize the top 20 of them and categorizing them in several big categories like un-employer, retired, self-employer, and employed.

Bivariate Plots Section

Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section?

First, check out the correlation between numerical variables.

## 
## Two-Step Estimates
## 
## Correlations/Type of Correlation:
##                   contb_receipt_amt date_upto_elec total_population
## contb_receipt_amt                 1        Pearson          Pearson
## date_upto_elec               0.1567              1          Pearson
## total_population           0.006446       -0.03311                1
## latitude                   -0.08131        0.06907          -0.4263
## longitude                   0.04613       -0.04045           0.1552
## percent_white               0.04733        0.02832          -0.4823
## percent_black               -0.0539       -0.01749           0.2398
## percent_asian               0.04005        -0.0257           0.2655
## percent_hispanic           -0.05032       -0.01312           0.4094
## per_capita_income            0.1868       -0.05334          0.00238
## median_rent                   0.153       -0.06877            0.158
## median_age                  0.01807       0.002415          -0.4075
## party                        0.1051         0.1376          -0.2007
## gender                      0.04275        0.09999         0.003298
##                     latitude longitude percent_white percent_black
## contb_receipt_amt    Pearson   Pearson       Pearson       Pearson
## date_upto_elec       Pearson   Pearson       Pearson       Pearson
## total_population     Pearson   Pearson       Pearson       Pearson
## latitude                   1   Pearson       Pearson       Pearson
## longitude            -0.6452         1       Pearson       Pearson
## percent_white         0.4226   -0.2156             1       Pearson
## percent_black        -0.1285   0.02713       -0.7007             1
## percent_asian        -0.3599    0.1683       -0.3351       -0.1387
## percent_hispanic     -0.3684    0.2399       -0.7571        0.2246
## per_capita_income    -0.3753    0.2334        0.3374       -0.3364
## median_rent           -0.671     0.483       0.09245       -0.2376
## median_age            0.1715   0.02231        0.5211       -0.3594
## party                 0.1442  -0.09026        0.2111       -0.1272
## gender            -0.0003517  -0.01185      -0.05763       0.03318
##                   percent_asian percent_hispanic per_capita_income
## contb_receipt_amt       Pearson          Pearson           Pearson
## date_upto_elec          Pearson          Pearson           Pearson
## total_population        Pearson          Pearson           Pearson
## latitude                Pearson          Pearson           Pearson
## longitude               Pearson          Pearson           Pearson
## percent_white           Pearson          Pearson           Pearson
## percent_black           Pearson          Pearson           Pearson
## percent_asian                 1          Pearson           Pearson
## percent_hispanic         0.1015                1           Pearson
## per_capita_income        0.1446          -0.3251                 1
## median_rent               0.272         -0.08536            0.8205
## median_age              -0.1945          -0.3729            0.1778
## party                   -0.1123          -0.1412           -0.1372
## gender                  0.02402          0.04885          -0.05253
##                   median_rent median_age      party     gender
## contb_receipt_amt     Pearson    Pearson Polyserial Polyserial
## date_upto_elec        Pearson    Pearson Polyserial Polyserial
## total_population      Pearson    Pearson Polyserial Polyserial
## latitude              Pearson    Pearson Polyserial Polyserial
## longitude             Pearson    Pearson Polyserial Polyserial
## percent_white         Pearson    Pearson Polyserial Polyserial
## percent_black         Pearson    Pearson Polyserial Polyserial
## percent_asian         Pearson    Pearson Polyserial Polyserial
## percent_hispanic      Pearson    Pearson Polyserial Polyserial
## per_capita_income     Pearson    Pearson Polyserial Polyserial
## median_rent                 1    Pearson Polyserial Polyserial
## median_age           0.004418          1 Polyserial Polyserial
## party                 -0.1394     0.2104          1 Polychoric
## gender               -0.03283   -0.05711     0.2783          1

With the univariant analysis, now I will check the distribution across parties, candidates, genders, and occupations.

Contribution by Parties

In the first part, we know that more people would love to make donations for democrat party. In 2017 presidential campaign, democrat definitly got the big win compared to another two in New York State towards the money they got. Most people donate around 100 dolloars in both democrat and republican party, and people would love to donate more in the third party although there are less people involved in their donations. For the mean contribution amount, the rank was third party>republican>democrat, and IQ range has the same rank.

For the gender, more female donators were for in the democrat party, whereas in the republican party, the male donators totally dominate the contribution and they tend to donate more compared to the democrat party.The mean contribution amount as well as the IQ range is larger for males than females and they are significantly different.

## $female
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.04    15.00    27.00   120.90    75.00 10800.00 
## 
## $male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.1    15.0    30.0   148.8   100.0 10800.0
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  contb_receipt_amt by gender
## W = 3.3681e+10, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

Contribution by Candidates

From the general plots, the top three candidates that received the most donations are Hilary, Bernard, Trump, Cruz. On the other hand, based on the boxplot, we could see that most of the donations they got were below 500.

For the candidate behind each donation, we could notify that Hillary won the most donations, followed by Bernard and Trump in 2017 presidential campaign.

## # A tibble: 25 x 6
## # Groups:   cand_nm [25]
##                      cand_nm      party         sum      mean median
##                       <fctr>     <fctr>       <dbl>     <dbl>  <dbl>
##  1   Clinton, Hillary Rodham   democrat 145396196.3  373.4629   30.0
##  2          Sanders, Bernard   democrat   8395372.6   48.7556   27.0
##  3          Trump, Donald J. republican   5853323.2  165.6617   65.7
##  4                 Bush, Jeb republican   3709338.3 1623.3428 2700.0
##  5              Rubio, Marco republican   3009977.8  682.5346  100.0
##  6 Cruz, Rafael Edward 'Ted' republican   2055960.2  127.9935   45.0
##  7  Christie, Christopher J. republican    896712.0 1945.1453 2700.0
##  8           Kasich, John R. republican    888243.2  673.9327  250.0
##  9       Carson, Benjamin S. republican    709442.6  108.1303   50.0
## 10        Graham, Lindsey O. republican    515372.1 1758.9490 1500.0
## # ... with 15 more rows, and 1 more variables: n <int>

Contribution by Candidate of Each Party

If we dive deep into each candidate within party, we could see that there are more female support Hillary, which made the female become the majority in the democrat contributors.

Contribution by Occupation

In the big perspective, republican got more donations from retired group than democrat, whereas democrat got more from employed and unemployed group.

## 
## Two-Step Estimates
## 
## Correlations/Type of Correlation:
##                   contb_receipt_amt   employer
## contb_receipt_amt                 1 Polyserial
## employer                   0.002845          1

For the occupations, homemaker, CEO, real estate, and Lawer had the larger mean amount of distribution as well as the IQ range. In democrat donations, the mainforce were Retired, lawyer, CEO, homemaker, and consultant, and republican has the similar mainforce.

With respect to the top three candidiates, we could see that laywer put the most effort to support Hillary, followed by retired people. For candidate Berbard, unemployed people was the mainforce in the donation, and retired people played hard to support Trump.

Contribution by Employer

For the employer, people in several university employers are the mainforce in the democrat party, whereas homemakers made more contributions for the republican party. With respect to top 3 candidates, people in New York University and Columbia University donated the most for Hillary and Bernard, and homemaker contributed most for Trump.

Contribution by Date

First, I would love to check the correlation between the time and contribution. We could notice that the correlation is not that strong, but there are two peaks in this trend, and the second one is really near the final call of the election.

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(campaign_trend$date) and campaign_trend$contb_receipt_amt
## t = -0.37751, df = 316760, p-value = 0.7058
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.004153131  0.002811662
## sample estimates:
##          cor 
## -0.000670743

If we look at each month of the year, what does the distribution look like? We could see in general there were peaks after the holiday month and there were more contributions in 2016 than 2015.

## 
##   2013   2014   2015   2016 
##      1     43  58426 575082

For each party contribution trend, we could see except they all had the second peak that nearby the election, the republican and third party were more fluctuate, and there was a huge increase for the third party contribution when the election was approaching, which I feel that there is a proportion people in last year do not want to support either Hilary or Trump.

Finally, zooming into the candidates perspective. Hillary had an increasing trend of the contribution, even after the election. For Trump, there was an obvious peak nearby the election, and suddenly dropped down.Bernard seems get more donations in the early 2016, but falled behind Trump nearby the election.

For all the demographic features behind the zipcode region, the income per capita and the median rent has a weak correlation with the contribution amount.

## 
## Two-Step Estimates
## 
## Correlations/Type of Correlation:
##                   contb_receipt_dt percent_white percent_black
## contb_receipt_dt                 1    Polyserial    Polyserial
## percent_white            -0.002247             1       Pearson
## percent_black            -0.001827       -0.7007             1
## percent_asian             0.003847       -0.3351       -0.1387
## percent_hispanic          0.002925       -0.7571        0.2246
## per_capita_income         0.008978        0.3374       -0.3364
## median_rent                0.01014       0.09245       -0.2376
## median_age               -0.003933        0.5211       -0.3594
## contbr_year                   <NA>          <NA>          <NA>
##                   percent_asian percent_hispanic per_capita_income
## contb_receipt_dt     Polyserial       Polyserial        Polyserial
## percent_white           Pearson          Pearson           Pearson
## percent_black           Pearson          Pearson           Pearson
## percent_asian                 1          Pearson           Pearson
## percent_hispanic         0.1015                1           Pearson
## per_capita_income        0.1446          -0.3251                 1
## median_rent               0.272         -0.08536            0.8205
## median_age              -0.1945          -0.3729            0.1778
## contbr_year                <NA>             <NA>              <NA>
##                   median_rent median_age contbr_year
## contb_receipt_dt   Polyserial Polyserial  Polyserial
## percent_white         Pearson    Pearson     Pearson
## percent_black         Pearson    Pearson     Pearson
## percent_asian         Pearson    Pearson     Pearson
## percent_hispanic      Pearson    Pearson     Pearson
## per_capita_income     Pearson    Pearson     Pearson
## median_rent                 1    Pearson     Pearson
## median_age           0.004418          1     Pearson
## contbr_year              <NA>       <NA>           1

Finally, I want to look at the contribution amount with the zipcode on the map. As what we see, most of their contribution are from New York city, but republican do have a certain proportion from the rural area in New York State.

Bivariate Analysis

Tip: Summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

There is no strong correlation between other features and contribution amount. Only party, median rent and income per capita are more correlated as their correlation is larger than 0.1.

For other features,

  • Gender: More female donators were for in the democrat party especially for Hillary, whereas in the republican party, the male donators totally dominate the contribution.

  • Occupation: Republican got more donations from retired group than democrat, whereas democrat got more from employed and unemployed group. The mainforce in the donation are Retired, lawyer, CEO, homemaker, and consultant. With respect to the top three candidiates, we could see that laywer put the most effort to support Hillary, followed by retired people. For candidate Berbard, unemployed people was the mainforce in the donation, and retired people played hard to support Trump.

  • Employer: People in several university employers are the mainforce in the democrat party, whereas homemakers made more contributions for the republican party. With respect to top 3 candidates, people in New York University and Columbia University donated the most for Hillary and Bernard, and homemaker contributed most for Trump.

-Dates: There were peaks after the holiday month and there were more contributions in 2016 than 2015.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The median rent strongly correlated to the income per capita, and the percentage of white is correlated to the percentage of black.

What was the strongest relationship you found?

The median rent is the strongest correlated to the income per capita. For the contribution amount, the most correlated feature is income per capita, followed by the median rent.

Multivariate Plots Section

Now, I will pull out several correlated features including income per capita, median rent, and party, and map them with the contribution amount.

As what we could see, the people with more income will contribute more in the campaign, males tend to donate more with the same income, and the third party contributors tend to donate more with the same income. It is reasonable and understandable. But for the people with income less than 2000, the donation were varied.

People with higher median rent would love to donate more, which is reasonable and same as the relationship with the income per capita. For the people whose rent is below 500 dollars, their donations were a little varied, but not that much.

In general, the contribution dropped off as the time passed by, but the count increased.

The map here is useful for the party to identify where those donations came from, and it might be useful for different part to identify their potential donators.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Except the income per capita, median rent, and party info, the gender info and the date info do strength their correlation with the contribution amount.

Were there any interesting or surprising interactions between features?

With the increase of either rent or income per capita, the contribution amount increase more for republican than democrat, which means that republican supporters in the same income level as the democrat supporters were generous.

With the election approaching, the republican got more contributions especially nearby the election final call than the other parties in this democrat dominated states, which I guess they had the strategy in the whole country to get more people support themselves, which indirectly explain why Trump get into White House.

Model Building

Here I would love to try the logistic regression method to predict a donor’s contributing party by their gender, income level, rent level, donation amount and number of days before the election (the way to transform the date to the numberic type).

model <- glm(party ~contb_receipt_amt+date_upto_elec+per_capita_income+median_rent+median_age+gender,family=binomial(link='logit'),data=train)
summary(model)
## 
## Call:
## glm(formula = party ~ contb_receipt_amt + date_upto_elec + per_capita_income + 
##     median_rent + median_age + gender, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5474  -0.3424  -0.2610  -0.1946   3.6024  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -7.644e+00  8.558e-02 -89.324   <2e-16 ***
## contb_receipt_amt  3.771e-04  1.271e-05  29.668   <2e-16 ***
## date_upto_elec     2.585e-03  6.219e-05  41.559   <2e-16 ***
## per_capita_income -4.806e-06  5.344e-07  -8.993   <2e-16 ***
## median_rent       -5.702e-04  4.545e-05 -12.545   <2e-16 ***
## median_age         1.205e-01  1.845e-03  65.319   <2e-16 ***
## gendermale         9.398e-01  1.722e-02  54.586   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 135473  on 345560  degrees of freedom
## Residual deviance: 124230  on 345554  degrees of freedom
##   (4439 observations deleted due to missingness)
## AIC: 124244
## 
## Number of Fisher Scoring iterations: 6

We could see that all the paramters that we selected are statistically significant, which means that they all play a role in the decision of which party they donate. When all other variable are the same, the more the amount of contribution or number of days before election or the median age, the more likely he or she is republican supporter. On the other hand, the higher income or the donator is female, he is more likely a democrat donator.

##                     
## model_pred_direction democrat republican
##           democrat     158863      25134
##           republican      605        141
## Accuracy:  0.8606767
OPTIONAL: Did you create any models with your dataset? Discuss the strengths
and limitations of your model.

The accuracy on the test set is 0.86, which is pretty good to predict their political support. But the accuracy cannot tell the actual performance of the model as the whole dataset is really imbalanced. But based on the model summary, we could see all the coefficients are little bit small as there are several numberical variables, it’s little bit hard to make precise prediction if our donators info are not precisely correct.


Final Plots and Summary

Plot One

## # A tibble: 4,665 x 7
## # Groups:   cand_nm, date [4,665]
##      cand_nm       date      party median_date average_date sum_date
##       <fctr>     <date>     <fctr>       <dbl>        <dbl>    <dbl>
##  1 Bush, Jeb 2015-06-15 republican        2700     2280.952    47900
##  2 Bush, Jeb 2015-06-16 republican         750     1233.333     7400
##  3 Bush, Jeb 2015-06-17 republican        2700     2665.000    53300
##  4 Bush, Jeb 2015-06-18 republican        2700     2705.000    54100
##  5 Bush, Jeb 2015-06-19 republican        2700     2562.162    94800
##  6 Bush, Jeb 2015-06-20 republican        2700     2700.000     2700
##  7 Bush, Jeb 2015-06-21 republican        1850     1850.000     3700
##  8 Bush, Jeb 2015-06-22 republican        2700     2607.353    88650
##  9 Bush, Jeb 2015-06-23 republican        2700     2558.333    76750
## 10 Bush, Jeb 2015-06-24 republican        2700     2675.000   117700
## # ... with 4,655 more rows, and 1 more variables: count_date <int>

Description One

As what we could see, the democrat dominate the donations in New York State. In general, the campaign got most donations in 2017, and it seems having the seasonal contribution peak especially after the summer and winter holiday season. All the parties has the second peak in their donation trend, but its occurance is little bit different as the third party came much earlier and the republican came much later. For the top three candidates, we could see Clinton got the huge win in New York State. Sanders and Trump were paralleled in the amount of donations, but trump seems get more donations when approaching to the final call of the election.

Plot Two

Description Two

The donator’s income level and rent level are positively correlated with the amount of contribution, which is hidden behind the zipcode info they provided. As most donations were from the New York City, it is really hard to identify what’s the difference spacially. But exploring those region’s demographic info, we could notify that republican supporters were more generous than the democrat supporters, althought democrat supporters dominate the New York Region.

Plot Three

Description Three

As we could see, the main force in New York State Campaign were CEO, Homemaker, Lawyer, real estate, doctor, and professor. The actively involoved employers were from university, government agency, and several technology companies.

Although different parties have different supporters, they do have specific characteristics when speaking with the career info. For the democrat party, their supporters mostly were from University, and their occupations were more like retired people, CEO, lawyer, homemaker or consultant. For the republican party, their supporters were more from homemakers, self-employed, and they were more like homemakers and retired people.


Reflection

In this project, I was determined and designed to identify what’s the reason behind the donations amount, and what’s the difference between the different parties supporters. The most correlated feature with the contribution were party, gender, median rent, income per capita, and the date of the donations. For the difference between different parties’ supporters, if a male tend to donate more than average who lives in the region with lower income and lower rent, he is more probably a republican supporter.

For the challenges and difficulties I met,

Data Preprocessing - The original dataset is little bit messy especially relates to the occupation, employer info, and geolocation info. For the geolocation info, I used the zipcode library to clean it and use the zipcode dataset to get the longitude and latitude based on the zipcode info. For the occupation info, I didn’t clean and categorize all of them and only focus on the top 25 of them to clean, analysis and plot.

Data Visualization - It’s really hard to organize the clear ananlytical and logical process before hand. So it takes time to try, to change, to find, to orgnize.

Data Modeling - The plotting sometimes is not the direct way to find the relationship between variables. So I spent time to extract the info from the plot as well as try to identify the relationship based on the statistics method.

For the success,

Conclusion

By analyzing the financial campaign dataset, here are several intersting results that I found, - New York State were domiated by democrat supporters, who are retired people or professor in the university, Lawyer, homemaker, and doctors. - Hillary got the most donations, followed by Bernard, and Trump. - Female donators were little bit more in the democrat supporters, and males definitely dominate the republican donations. - In general, republican supporters were more generous and donated more than democrat if their income level is same. - Party, income level, rent level, and gender info impact how much money people would love to donate. Also, time is very important, as more people would love to donate in the period after the holiday seasons. - For democrat and republican supporters, they do have different characterics, especially with their career info. Homemaker is a huge part in the Trump donations, and university is the key in Hillary donations.

Shortcomings

Future